专利摘要:
The present invention relates to a hardware acceleration method and system for storing and retrieving information, which implements a cortical learning algorithm through a packet switching network. The system comprises: an encoder module for providing an sdr input and sending multicast packets to certain columnar modules connected together by the packet switching network; where the columnar modules comprise in turn: a router, a plurality of memory modules configured to store the inputs received from the router and store context information; and a calculation module that calculates the overlap of the inputs, selects the memory modules with greater overlap, determines a temporal context for the selected memory modules and sends a prediction of the system output to a classifier module, which selects an output of the system among a group of pre-established outputs, based on said prediction. (Machine-translation by Google Translate, not legally binding)
公开号:ES2558952A1
申请号:ES201500841
申请日:2015-11-20
公开日:2016-02-09
发明作者:José Ángel GREGORIO MONASTERIO;Valentín PUENTE VARONA
申请人:Universidad de Cantabria;
IPC主号:
专利说明:


HARDWARE SCALABLE ACCELERATION SYSTEM AND METHOD FOR STORAGE AND RECOVERING INFORMATION


DESCRIPTION
TECHNICAL FIELD OF THE INVENTION
The present invention relates to the technical field of artificial intelligence and more specifically to artificial neural networks implemented in hardware to store and retrieve information.
10
BACKGROUND
The application of algorithms based on neural networks consists in the automatic processing of information inspired by the way in which the nervous system of animals, neurons and their connections works. The neurons can be distinguished in groups grouped in columns, which are connected through the low-range axons to other nearby columns (through the cortex layer I) or to other distant columns and to the sensor-motor interface, that is, the thalamus (through layer VI). Figure 1 represents these structures of micro-cortical columns (1) and cortical hyper-columns (2). Regardless of the functionality of each zone, the cortex 20 is morphologically very regular.
Empirical evidence suggests that the neuronal system represents information following a distributed distributed representation SDR to store and retrieve information. In this representation, in contrast to the conventional representation
25 binary data (also coined as a localist representation), each bit has semantic meaning and the data representation is highly resistant to the noisy environment and prone to failures (as is the biological one), that is, the unwanted change of a low number of bits in the original representation always produces a value similar to the original.
30
Cortex can be understood to function as a self-associative memory, hierarchically structured as a hierarchical temporary memory (HTM). This statement, based purely on the observations of neuroscience, presents a precise algorithm, called the cortical learning algorithm (CLA), which provides the
rules for storing and retrieving information, that is, learning and making predictions. This concept has been used in practical problems, such as anomaly detection, sequence prediction, pattern identification, etc. imitating the behavior of upper layers of the cortical spine.
5
The CLA algorithm focuses on partially replicating the functionality of cortical micro-columns, where layer I is mainly used for interconnecting the different columns in the same hyper-column; the ll / lll level, generally referred to as the inference layer, is supposedly dedicated to predicting the state of the column in the next steps of the entry; and layer IV, called the sensory layer, deals with the input signals to the column. The operating principles of layers V and VI are not yet well understood and currently CLA does not model them, but the key point of this organization is that the same hyper-column can be reused by different hyper-columns at the next level and through of the entire hierarchy, the level of information that a column can identify will be increasing (condensing the semantics of the lowest levels).
The CLA algorithm defines the term column (20), represented in Figure 2, which is sufficient to handle the prediction without the hierarchical structure. At the bottom, a proximal dendritic segment (21) could be connected to a subset of the bits of the SDR input. This restriction models the fact that the activity of the input axon will be observed by a subset of columns. These segments model the dendritic growth of the system's direct power connection, which is well known to be responsible for learning in the cortex. In contrast to other artificial neural networks, each segment synapse is characterized by a binary value, that is, it is connected or not. For a given coded input, in each proximal dendrite segment the number of active synapses is determined, that is, the number of active inputs connected to the segment with a connected synapse (this is called input overlap). Once this is known, as in the biological systems, an inhibition process begins and only about 2% of the best of the columns with more active synapses are selected. The remaining columns are inhibited. The synapses, which have been activated by the entry in the winning columns, are strengthened and the synapses connected to the inactive entrances are weakened. In order to manage learning, for each synoptic connection 35, a permanence value is monitored. If the value is above a predefined threshold, the synapse is considered connected. At boot time,
5
10
fifteen
twenty
25
30
35
the values are chosen at random, close to the threshold value. Typically, three or four bits may be sufficient. In software implementations, by default, the threshold can be 0.2, maximum 1.0 and learn with an increase of 0.1 (damping or forgetting is usually an order of smaller magnitude, but it can be avoided increased resolution by using a random subtraction). In this way Layer IV of the cortical micro-columns is emulated which, in the CLA / HTM terminology, is called spatial pooling. The intuition behind this grouping is to "filter" the most notable characteristics of the input in order to subsequently store the sequence.
On the other hand, when a column is activated, that is, it wins the inhibition process, the (temporary) cells have to process that information. Each column will have a few tens of cells (22). A number of SDR-compatible columns will represent an encoded input. Therefore, after inhibition, the winning columns represent the most outstanding characteristics of the entry. In a "contextless" environment, a single cell would be enough to make the prediction. However, to obtain a prediction that depends on the context, both a current value (26) and a time sequence context are needed. To do this, each cell per column represents the value of the input in a time sequence (that is, the memory must be able to predict the successive sequences). Even with a low number of cells per column, the number of "contexts" that the system can store for the same value is huge. For example, in a system with 2048 columns and 32 cells per column, you will be able to capture 4032 different time contexts for the same entry.
Each cell could predict the status of the column in the next entry in the sequencer. To do this, it uses dendritic segments for modeling column relationships. Each distal dendritic segment (23) stores potential synapses with other cortex columns. The rules for handling such synapses are similar to those of the proximal segment. If any of the cell segments reaches a given threshold, it enters the predictive state (24), which means that said column will be activated (25) in the next period or time. When a column was not predicted correctly, all the cells in the column attempt to connect to the sequence seen previously. First, new distal segments are constructed on the fly according to previous remote activations and, secondly, cells are sought that must predict activation in the next period, which mimics layers ll / lll in the biological columns . The intuition is to use the synapse between the different columns in the system to obtain a meandering path between cells
that represent the different temporal contexts. The CLA / HTM terminology used for this task is temporary pooling.
Today, the advances made in HTM are implemented in software, which technically limits the systems to a few thousand columns. Instead of precise weighted connections, HTM uses a complex dynamic topology to store and retrieve information, which, from a simplistic hardware perspective, is not possible (a single column can potentially be related to tens of thousands of different columns). Existing solutions are very demanding in memory and 10 require millions of clock cycles to produce each prediction. Problems such as pattern recognition based on the saccadic mechanism require much larger and faster systems and although efforts are being made on FPGA-based approaches or emerging technologies such as 3D stacking and non-volatile memories that could somehow alleviate these strict requirements, the The state of the art would receive as a valuable contribution any solution that presented a feasible hardware implementation to overcome this problem and reduce costs and execution time.
DESCRIPTION OF THE INVENTION
The present invention solves the problems mentioned above, presenting the architecture of a hardware implementation employing architectural techniques and methodologies such as general purpose chips or multiprocessors. Specifically, the limitations implied by software implementations known in the state of the art, for the CLA cortical learning algorithm, are overcome by the present invention, which refers in a first aspect to a hardware acceleration system for storing and retrieving information, which implements said cortical learning algorithm through a packet switching network. The system includes:
at least one encoder module configured to encode a binary input in a distributed distributed representation (SDR), and to send, for each active bit of the SDR, a multicast packet to a determined column module through the packet switching network, based on a table of correspondence previously established;
a plurality of columned modules connected by said packet switching network, configured to receive multicast packets
twenty
25
30
sent from the encoder, where each of the columnized modules includes:
or a router with multicast support configured to receive packets from the encoder module, deliver said packets to certain memory modules of the columnar module and send packets from the memory modules to an output sorter; or a plurality of memory modules configured to store the inputs received from the router and store context information;
or a calculation module configured to determine a degree of overlap between the content of certain memory modules and the current input, select a specific number of memory modules with a greater degree of overlap, determine a time context for each of the modules of selected memory, make a prediction of the system output based on the current input and temporal context information and send an output packet containing said prediction to an output classifier module; an output classifier module configured to receive an output packet, sent through the switching network from any of the columnar modules, and to select a system output from a group of preset outputs based on the received packet output.
The system of the present invention, according to one of its particular embodiments, contemplates that the calculation module comprises a comparator, an adder and a counter.
The present invention contemplates, in one of its embodiments, that each memory module of the plurality of memory modules comprises a plurality of temporary cells, which adopt an active state or a non-active state and their combination represents a certain temporal context for the memory module. Advantageously, the representation of different time contexts 30 that allow predicting the following inputs is achieved and, in addition, the sequences that the system can store contribute directly to learning for future entries.
Additionally, the present invention contemplates that the calculation module is configured to check if its output prediction is correct; in case of a wrong prediction, a burst occurs that puts all the temporary cells of the memory module in active state. Thus, the learning of the system is advantageously tuned.
10
fifteen
Optionally, the present invention, according to one of its embodiments, contemplates that the calculation module is further configured to combine stages and, given an input sequence, produce a prediction in three intervals of said sequence. Advantageously, the capabilities of the network are thus taken advantage of and the CLA algorithm can be segmented to inject the results of each stage into the network without having to wait to finish all the calculation phases.
Additionally, one of the embodiments of the present invention contemplates the possibility that the calculation module is also configured to add traffic from different stages in the same package. It is a further measure of optimization that the present invention can incorporate to enhance the advantages of the segmentation of the algorithm discussed above.
The columnar modules located at the ends of the network, according to one of the embodiments of the invention, are contemplated to be configured to inject a broom package into the packet switching network, which is replicated in the remaining 15 modules Collated only when the corresponding router has no more queued packets until the broom packet reaches the opposite end of the network, indicating that the network has been emptied. This advantageously serves as a mechanism to ensure the correct execution of the overlap calculation steps and determine the temporal context.
One of the particular embodiments of the invention contemplates the possibility that the number of memory modules comprising each of the columnar modules is determined by a balance between the propagation delay and the system clock cycle.
The coding module of the present invention, or one of the coding modules, 25 can be configured to send the input packets to a randomly preset columnar module selection representing about 20% of the total columnar modules. Thus, another optimization of the present invention is advantageously provided where, depending on the inputs or specific applications, the size of the selection, or proximal patch, could be dynamically varied.
The system proposed by the present invention is implemented, according to different particular embodiments, on a silicon plate, a chip or a microprocessor using CMOS technology.
A second aspect of the invention relates to a scalable acceleration method.
by hardware to store and retrieve information through a packet switching network, the method comprises the steps of:
a) encode, in an encoder module, a binary input in a representation
5 distributed dispersed (SDR)
b) send, for each active bit of the SDR, a multicast packet from the encoder module to a given column module of a plurality of modules columned through the packet switching network, based on a table of correspondence previously established;
10 c) receive packets sent from the encoder module, through the network
packet switching, in a column module router;
d) deliver said packages to certain memory modules of the columnar module;
e) store received packets in certain memory modules;
15 f) determine, in a column module calculation module, a degree of
overlap between the contents of the memory modules that have received the input package and the current input;
g) select, by the calculation module, a certain number of memory modules with a greater degree of overlap;
20 h) determine, by the calculation module, a temporal context for each of
the selected memory modules;
i) make, by the calculation module, a prediction of the system output based on the current input and the temporal context information stored in the memory modules;
J) send an output packet containing this prediction to a module
output classifier;
k) receive an output packet in the output classifier, sent through the switching network from any of the columnar modules;
l) select, in the output classifier, a system output from a group
30 of preset outputs depending on the output package received.
5
10
fifteen
twenty
25
30
According to one of the embodiments of the present invention, the proposed method contemplates checking if the output prediction made by the calculation module is correct, where, in case of a wrong prediction, a burst occurs that puts all the temporal cells of the module of memory in active state.
Additionally, the present invention may include the step of verifying that the packet switching network is empty before executing the steps of calculating the overlap and determining the time context, where, to verify that the network is empty, a broom packet is provided that runs through the packet switching network.
Optionally, the present invention contemplates in one of its embodiments the step of restricting the packets sent by the encoder module to a selection of columnar modules, randomly preset, which represents about 20% of the total columnar modules.
Inspired by the biological properties of axons and dendrites, the present invention thus defines a system that uses a logical construction to satisfy the topological flexibility of the known CLA algorithm through an on-chip network. Unlike other state-of-the-art learning systems, the CLA algorithm calculations are simple (low precision addition and subtraction, simple comparisons), so adding some calculation logic to the routers of that network and some modules of memory for storing the connectivity state, the present invention implements the CLA algorithm without the need for complex general purpose processors, where the communication substrate, and the procedures for achieving a feasible hardware implementation of the known CLA algorithm, are based on the use of a packet switching network and various techniques used in computer architecture, which also guarantee the scalability of the system. The combination of all the techniques presented in the present invention makes it possible to reduce, on average, approximately 95% the delay of the network and the necessary energy.
In addition, the hardware implementation proposed by the present invention implies a multitude of additional advantages such as extending the spectrum of application of the algorithm allowing, for example, to easily combine it with von-Neumann type computing, use it as a neuronal processing accelerator, allow exploring the potential of the hierarchical organization, or research on the underlying mechanisms and
unknown neo-cortex. Therefore, an implementation based on silicon as proposed in the present invention is a valuable contribution to the state of the art.
DESCRIPTION OF THE DRAWINGS
5
To complement the description that is being made and in order to help a better understanding of the characteristics of the invention, it is accompanied, as an integral part of said description, a set of figures where, for illustrative and non-limiting purposes, represented the following:
Figures 1a, 1b.- represent structures of micro-cortical columns (1a) and structures of cortical hyper-columns (1b) on which the present invention is based.
Figure 2.- represents one of the columns according to the CLA algorithm.
Figure 3a.- represents a high level description of the architecture proposed by one of the embodiments of the present invention.
15 Figure 3b.- represents a high-level sketch of one of the columnar modules of Figure 3a.
Figure 4.- represents the necessary steps for the CLA algorithm
Figure 5.- represents the segmentation of the CLA algorithm and how the stages are simultaneous.
Figure 6.- represents an example of optimization according to one of the embodiments of the invention, where a proximal patch is shown in a honeycomb topology.
Figure 7.- represents an example of optimization according to one of the embodiments of the invention, where several scale-out areas are shown.
Figure 8.- graphically represents the number of clock cycles, per interval, for 25 different sizes of 2D square mesh.
Figure 9.- graphically represents the number of clock cycles, by interval, for different 2D square meshes, using aggregate traffic.
Figure 10.- graphically represents the number of clock cycles, by interval, for different 2D square meshes, with the segmented algorithm, traffic aggregation and 30 proximal patches applied.
5
10
fifteen
twenty
25
30
Figure 11.- graphically represents the number of clock cycles, varying the width of the link.
Figure 12.- graphically depicts the clock cycles required by the network to process an interval with different link widths.
Figure 13.- graphically represents the dynamic energy requirements of the network to process an interval from the input stream.
Figure 14.- graphically represents the cycles, normalized to the base algorithm, by input interval (16x16 mesh).
Figure 15.- graphically represents the dynamic energy of the network by interval, normalized to the base algorithm.
Figure 16.- graphically represents the probability of poorly predicted columns, normalized to the base algorithm.
DETAILED DESCRIPTION OF THE INVENTION
What is defined in this detailed description is provided to help a thorough understanding of the invention. Accordingly, people moderately skilled in the art will recognize that variations, changes and modifications of the embodiments described herein are possible without departing from the scope of the invention. In addition, the description of functions and elements well known in the state of the art is omitted for clarity and conciseness.
Of course, the embodiments of the invention can be implemented in a wide variety of architectural platforms, protocols, devices and systems, so the specific designs and implementations, presented in this document, are provided solely for purposes of illustration and understanding, and never to limit aspects of the invention.
The present invention discloses the implementation of a hardware accelerator based on the cortical learning algorithm for storing and retrieving information, where details, from the perspective of computer architecture, are given below.
5
10
fifteen
twenty
25
30
35
The basic assumption of HTM memories and CLA algorithms is that synaptic plasticity (through dendritic growth) is the key element of cortex for learning. This assumes that information is stored in the relationship between the columns, dynamically defined by the connections established during learning. Therefore, the storage capacity is proportional to the product of the number of columns by the maximum number of connections per column.
Although neuron connectivity can be potentially very high (dendritic stretch marks can provide up to tens of thousands of potential synapses), many of these synapses are not active (i.e., the pre-synaptic axon is too distant from the dendrite) or Multiple active synapses correspond to the same pair of neurons (as a redundancy mechanism), so that instead of electrically replicating the morphology of biological systems, which would currently be impossible, the present invention introduces such functionality into a switching network of packages.
Mainly, it is the communication substrate the object of organization and optimization to emulate axon activity and correctly apply the prediction and learning algorithms of the HTM / CLA. The present invention, instead of using the synapses to establish an active connection between two columns, uses memory structures associated with a plurality of routers to model said dendritic segments and using a simple calculation logic to perform the tasks of spatial grouping (spatial pooling) and temporal grouping (temporary pooling).
Figure 3a presents a high level description of the proposed architecture, where an encoder (31) can be identified in the system input, responsible for converting a localist input into an SDR representation; and a classifier (32) at the exit of the system, responsible for carrying out the intended purpose, for example detecting anomalies in an input sequence, predicting the next value in the input sequence, comparing certain patterns etc. Among the encoder and classifier modules, CLA mechanics is implemented through a component that, in this document, will be referred to as Columnar Core (CC) or columnar modules (33). In one of the particular embodiments illustrated by Figure 3a, for explanatory purposes only, a system of 16 columnar modules, CC0-CC15, is connected via a packet switching network, for example with a square mesh shaped topology , but configurations of other dimensions would be equally possible taking advantage of one of the greatest advantages of the present invention, its scalability.
5
10
fifteen
twenty
25
30
35
Figure 3b represents a high level sketch of a CC. In this particular case, it is assumed that each CC has B columns and t temporal cells per column. The system is homogeneous, as is the biological cortex, and the present invention, for its hardware implementation, takes into account the following requirements for the three different sections: communication (34), calculation (35) and memory (36):
A. Communication Requirements
The interconnection network has to handle all the traffic generated by the CLA algorithm, that is, the incoming traffic from the encoder, the inhibition traffic and the lateral activity of the activations of the temporary cells (38), as well as the sending the activation pattern to the classifier. Such activity is carried out, in the present invention, at the logical level, using packages instead of physical cables. For example, according to one of the embodiments, each output bit of the encoder is connected to a set of statically defined columns (37) so that for a given input, each of the active bits in the SDR representation will send a packet multicast (multicast) to the CC where the columns, or memory modules, that have to receive them reside. The encoder has, in one of the embodiments, a table that relates columns and entries. Therefore, the multicast package used by the present invention emulates the activity of each axon. Similarly, when a column reaches a predictive state, a single packet will be sent to the classifier.
Internally, the router (39) will receive logic inputs from a calculation module (35) for the spatial grouping "spatial pooler" (the column overlap used in the inhibition procedure) and for the temporary grouping "temporary pooler" ( cell activation events). These entries must be sent to potential packet recipients. In the CLA algorithms of the state of the art implemented in software it is assumed that, in most cases, all the columns of the system must be aware of this, that is, the potential receivers are all the columns. For example, in the global inhibition (which is the default method), any column must be aware of the overlap with the entry of the rest of the columns. The overlap is calculated as the number of synapses connected in the proximal segment of the columns for a given input. With this type of information, the calculation logic can determine if the current column is within the set of 2% with greater overlap and feed the temporal grouping logic. Similarly, for the construction of distal segments, although probabilistically limited, the algorithm assumes that each column is aware of all cells
5
10
fifteen
twenty
25
30
35
(temporal) in a predictive state, which is equivalent to assuming that the effects of axons are diffused to all cells in the system.
There is therefore a large amount of multicast traffic that requires considerable network bandwidth and remarkable power consumption. In addition, for any of the calculations made in the calculation part, only local information can be accessed, so that no centralized component can be relied on to extend the system to thousands of CCs and therefore, system synchronization is a challenge .
B. Computational Requirements
Most of the axon activity, as already mentioned, is modeled according to the present invention as multicast traffic. The calculation logic at the destination will be responsible for making the prediction and learning procedures for each incoming package.
There are two stages in the CLA algorithm that have to be applied sequentially, once all traffic in the current cycle has been drained out of the network:
The spatial grouping, where the calculation module will evaluate the overlap or overlap of the entrance with its proximal segment (that is, the number of active synapses) and, assuming that the inhibition is global), will spread its value to the rest of the columns in the system. In the local inhibition a multicast packet will be used. Each column is aware of its own overlap, with a simple comparison with the incoming package you will know if it is between the 2% more active, once all traffic is drained. The synapses in the table of proximal segments of the active entries will be updated if the column was active. Therefore, according to one of the embodiments of the invention, the calculation module comprises, to perform these spatial logic operations, a comparator, a 4-bit adder and a counter. Note that the maximum overlap requires ~ Log2lnput bits. For a 2048 input encoder, 12 bits are sufficient.
- The temporary grouping, where the calculation module will evaluate any lateral activity. Assuming that the axon of the (temporal) cells is global, a diffusion will be generated. The input package will include the source column and temporary cell. This is maintained in a list of current activations. Once the current cycle is finished, the logic will determine, for each column, if the activation was correctly predicted. In that case, the segment
5
10
fifteen
twenty
25
30
35
Corresponding distal temporal cell in the predictive state will be updated accordingly (this emulates the growth of dendrites). If the column was not predicted correctly, the logic must maintain the activations of the previous cycle to search for the nearest distal segment (or create a new one if it did not exist). From the hardware perspective, this requires an extensive search across all dendritic segments of the column and determine which dendritic segments of the column are active. The cells (temporary) with an active dendritic segment will generate a diffusion / multicast in the network and, finally, the columns that were not correctly predicted will produce a burst or burst (burst), as does the biological system, which is equivalent to putting all the temporary cells of the column in active state (selecting only one to perform the learning).
C. Memory Requirements
The proximal segments store the permanence of the synapses with each potentially connected input bit. It should be noted that each bit of the SDR representation produced by the encoder is potentially connected (ie, a synapse could be formed) to the selected subset of columns at startup (which can be selected uniformly). In general, it can be assumed that each bit can be connected to any column in the system and, therefore, the proximal segment has to have an input for each potential input, but according to the present invention, each column will be connected (i.e. , a synapse will form) to a very small subset of encoder inputs, as is the case in practice. Therefore, the proximal segment is structured, according to one of the embodiments, as a conventional cache memory indexed by the input index. In practice, a capacity for 64-128 entries seems to be sufficient for a 2K column system. The value of permanence needs to be stored there. As in biological systems, the accuracy required by the algorithm is low (typically less than 4 bits are sufficient). For example, assuming a 2K column system, with 1K entries, the aggregation of all proximal cortex segments will require (including labels) between 0.25MB and 0.5MB (12bits • 64 • 2K; 12bits ■ 128 ■ 2K), that from a hardware implementation perspective, the task of manipulating these segments seems simple.
In contrast, distal segments seem significantly more difficult to handle. In a simplistic approach, each distal segment will require as many synapses as columns
5
10
fifteen
twenty
25
30
35
there is in the system. In addition, each temporary cell may require multiple segments (typically in the range of 128 to 256). For a system with 2K columns, of 32 temporary cells each and 256 segments per cell, assuming 4 bits to store the permanence, the segments of each cell will require 8MB. Therefore, the total memory required for the system will be prohibitive for a real physical system. However, as in the case of biology, only a few of the potential connections are required. For example, by restricting each segment to the most active synapses (using, for example, a stack-based approach) the number of potential connections that must be reserved can be greatly reduced.
Therefore, once the three problems mentioned above have been identified for the columnized modules (communication and synchronization, complexity of the temporal logic of grouping and organization of the distal segments), from the point of view of scalability, and therefore the most relevant For the present invention, it is the first. Given that in biological systems, the fundamental difference between species seems to be dominated by the number of neurons and not by the number of synapses per neuron, it seems appropriate to think that the problems of the internal logic of CCs are not an inconvenience. important since the tables, and the time required by the calculation logic for the temporal grouping, do not need to scale with the total number of columns.
On the other hand, it is evident that, by increasing the number of columns in the CLA algorithm, the requirements for communicating and synchronizing the different CCs will be substantially higher. The communications substrate will be responsible for facilitating the learning of new temporal and spatial patterns and, from this perspective, the most demanding problem for the present invention is addressed: the communications substrate that is needed to model axon activity and synchronization of CCs in an efficient and fast way. The key aspects for this communications substrate are detailed below:
A. Characteristics of the Network
Since all axon activity is modeled as multicast packets, the router used by the present invention requires multicast support. With the support of the network, the energy needs of the multicast packets will be lower, since the copy of the packet is made near the destination and allows to obtain a lower latency, since its replication in the injection is not necessary and, by Therefore, there is no delay for this reason.
5
10
fifteen
twenty
25
30
35
The required package size is quite small. For example, inhibition traffic will require the identification of the source column and the overlap (Log2Numcolumns + Log2NumEncoderlnputs). Lateral activity will require the source column and the identification of temporary cells (Log2Numcolumns + Log2NumTemporalCells). The input activity will require source identification (Log2Numinputs). For a system of 2,048 columns / entries, with 32 temporary cells per column, the required size will be 22, 16 and 11 bits respectively. Therefore, some of the embodiments of the invention contemplate narrow communication links, which further decrease the energy needs and costs of the router.
According to one of the particular embodiments of the invention, assuming Log2Numcolumns wide links, packages composed of a single flit (Flow Control digiT) can be used in most cases and, under such circumstances, storage needs within Routers will be low.
During the steady state, approximately 2% of the cortex columns will have activity. Therefore, one of the embodiments of the present invention contemplates a low-grade network and narrow links to meet the requirements. High-grade networks would require increasing the complexity of the routers and the cost of wiring. For example, two-dimensional bulls or meshes, or even honeycomb networks, could meet such requirements. Another embodiment of the invention, in which a fault-tolerant network is assumed, contemplates increasing the size of the system without problems of production defects (yield) and even the use of wafer-wafer integration techniques in a 3D environment.
B. Synchronization
The CLA algorithm mainly comprises four phases: calculate the overlap or overlap of the proximal dendrites with the current coded input, determine the winning cortex columns, determine the lateral activity in each temporal cell of the column and produce the prediction. Overlapped with these phases, the adaptation (that is, learning) of the synoptic segments is carried out.
The difficulty of executing these phases in a fully distributed way is knowing when each one should be executed. For example, the determination of input overlap should not be executed until all input activity is received (that is, each column is aware of all axon activities). Since there is no confirmation message of the reception of axon activity, each CC must
5
10
fifteen
twenty
25
30
35
Be aware of when to execute the corresponding part of the algorithm. Similarly, the inhibition cannot be activated until each column is aware of whether it is within the most active itself and, finally, the prediction cannot be made until the lateral activity of the related temporal cells is known. The simplest but effective way to avoid this problem is to empty the content of the network before moving on to the next phase. If the network is empty, there is a guarantee that all the influencing packages will have reached their destination.
Figure 4 details all the steps necessary for the CLA algorithm. In addition to coding (40) and classification (50), there are nine additional stages, three of them perform the calculation of spatial and temporal logic (43, 46 and 49), three correspond to the activity of the axon (41,44 and 47) and, finally, three others are necessary to drain the network (42, 45 and 48).
The problem of synchronization in the present invention is then reduced to providing a scalable network drainage mechanism. To ensure the scalability of such a mechanism, a simple and effective way of doing it within the network itself is needed. The present invention contemplates, according to one of the embodiments, to use routing in order of dimension, to inject a special diffusion package, called a broom package, into the CCs of the ends of the network, corresponding to the smallest identifiers (IDs) and the largest (in the example in Figure 3a they would correspond to CC0 and CC15). Each broom package will be allowed to pass to the next router only if the local router has no more packages and the transit buffers in the ports, for which the router has received the copies of the package, are empty. The package is replicated on all remaining ports. For example, when CC5 receives, from CC4 and CC1, the broom package of CC0, we know that there are no activity packages that can affect the columns that CC5 handles. When the West and North transit queues are empty, the router replicates the CC0 broom package through the South and East ports. This operation will be applied throughout the cortex until the CC15 core receives the CC0 broom package. At this point, CC15 is aware that there are no longer any packets in the network for it and can advance to the next stage of the algorithm. Similarly, when an intermediate CC receives all the broom packages of CC0 and CC15, it knows that there are no pending packages in the network for it. It should be noted that this mechanism operates in a completely distributed manner and scales according to the available bandwidth of the network.
In biological systems, such drainages do not appear to be necessary because the rate of entry of the changes is sufficiently spaced to ensure that spatial and temporal activity is performed satisfactorily. When the entry fee
5
10
fifteen
twenty
25
30
35
It is too high, the system will be unable to learn or predict. As a simple example, an excessively rapid change of an image will be perceived as noise by the visual cortex. Although a similar solution could be applied to the present invention, the encoder and the data are not as tuned as in the biological systems and will make it more than advisable to incorporate the proposed network drainage solution.
C. Algorithm Segmentation
The stages of the algorithm require a substantial amount of time and energy. However, as can be seen in Figure 4, stages can be identified as in the case a general purpose processor. Therefore, the present invention uses the same optimization techniques employed there. In particular, according to one of the embodiments, the algorithm is segmented to combine activities of different stages and reduce their number to only three times for each input data. Figure 5 shows how that organization will be beneficial once the pipeline is loaded. The idea is to start calculating the overlap of the next entry as soon as the current overlap has been calculated. Then, in the interval 54, two operations are carried out in the network simultaneously. If we move forward in time we verify that three different input operations can be superimposed in a single stage. In the interval 57 we are transmitting the communication of the distal activity of the first input value, the inhibition traffic of the second input value and the realization of the proximal activity of the third input data. In the interval 59, the prediction for the first period, the calculation of the lateral activity for the second and the calculation of the overlap for the latter is carried out simultaneously. And, more importantly, only one drain of the network is needed for each input value. Once the pipeline is loaded, only three intervals are needed in the input sequence to produce a prediction.
D. Communication and Computing Overlay
Simultaneousing the algorithm stages, as explained above, opens the possibility for further improvements, since it is not necessary to complete the calculation phases (that is, the calculation of overlap, lateral activity and prediction), before Start sending the result of each. As soon as the calculation logic begins to generate axon activity, it can be injected into the network. Therefore, the number of clock cycles necessary to process a value in the input sequence will be determined by the slowest part: communication or calculation. The number of cycles required by the slowest and the clock cycle time will determine the time required to process a sample of the input sequence. Finally, the network of
5
10
fifteen
twenty
25
30
35
Drainage must be carefully managed: broom packages are sent to the router of each CC if all local columns have completed the actions to be performed in the current interval.
E. Traffic Added.
The optimal organization according to one of the embodiments of the present invention and from the point of view of latency, is the combination of several columns in a single CC. Using many routers with very short links can unnecessarily increase the average latency in the network. To optimize this latency, the size of the CC (that is, the number of columns it handles) must be adjusted so that the propagation delay and the network clock cycle are similar. With this approach it is possible to add multiple activations of axons from columns in the same CC, in a single package. Although this could increase the number of packet flits (their length), this will significantly reduce the network load.
Additionally, the segmented algorithm allows combining, in a single package, the actions from different stages in the algorithm. For example, the information of the inhibition can be combined with that of the lateral activations of the previous era and therefore, grouped into a single package. To carry it out, the existence of clustering injection queues is assumed (similar to the structure to support non-blocking caches, usually called MSHR (miss information / status handling registers), where any new injected packet is checked against those that are waiting to be injected If there is a match in the target mask, the previous package is modified to contain the information from the one that has just arrived and thus, the new package can be discarded.
F. Scalability
The biological system suggests that the best approach to increase storage is to increase the number of columns and not the number of temporary cells (and distal segments) per column. From a practical perspective, if we increase the number of columns we could reduce the number of distal segments required per temporal cell. Although from the software perspective this seems interesting, from the hardware point of view it is really relevant because it could reduce the cost of interconnection and perhaps the complexity of each CC. Therefore, the present invention, in one of its embodiments, contemplates increasing the number of columns as much as technology allows. Unfortunately, the communication system, as described so far, could scale up to a limited number of columns, but, clearly, the energy requirements will not scale using CMOS technology. TO
5
10
fifteen
twenty
25
30
35
This aspect will then be treated according to different embodiments of the invention that introduce two complementary strategies:
1) Scalable nearby traffic: Proximal patches
The software implementations of the algorithm known in the state of the art assume that the proximal synapses, in a given column, do not take into account the system topology. At boot time, the encoded input is potentially connected to a subset of the randomly chosen columns (by default, around 20%). During spatial grouping, the system learns the relevant inter-relationships according to the input sequence. Although from the software perspective this is beneficial, since it balances the use of the columns, for a hardware implementation it is very demanding. For example, using CCs with 5 columns and a subset of 20% of columns, this approach implies that axon activation in the encoder will require multicasting to all CCs in the system. Certainly, this approach departs from the functioning of biological systems. From this perspective, the present invention chooses, according to one of its embodiments, to limit the potentially connected columns to a topologically restricted area in the network, which will be referred to as a proximal patch. Figure 6 shows an example of a proximal patch (60) in a honeycomb topology. According to this particular embodiment of the invention, the encoder is connected to the network through injection queues, where each bit is connected to a different router on the periphery of the circuit. For the sake of simplicity, Figure 6 only illustrates this for two separate bits, but keep in mind that the encoder will have thousands of output bits. In this way the columns that could be connected to each CC are restricted within the proximal patch.
The present invention, in one of its embodiments, defines the position of the patch randomly at boot time. Its size is a design parameter and can be redefined according to the nature of the entry or the specific application. Experimentally, it has been observed that a 20% size rule is valid, although the dynamic increase or decrease of the patch can also be contemplated to balance the use of columns.
Under such circumstances, when the input connected to the column module R1 (61) is activated, a multicast will be generated to the CCs within the patch. The package is injected into the network and behaves as unicast until it reaches the column module CC1 (62). The header information must include such a column module 62 as an intermediate node and the multicast mask for the remaining nodes.
5
10
fifteen
twenty
25
30
35
2) Scalable distal and inhibition traffic
Similarly, distal and inhibition traffic is assumed to be global for software implementation (although inhibition may be local). From the network perspective, the required delay and power will increase significantly as we increase the number of CCs. It should be noted that the number of columns involved in the inhibition process is significantly greater than the active inputs in the encoder (any non-zero input overlap will require multicasting). However, biological systems certainly do not use global communication. From this hypothesis, the present invention, according to one of its embodiments, divides the network into separate zones and restricts inhibition and distal traffic within. These regions will be referred to as scale-out zones.
Figure 7 graphically depicts how to increase the number of CCs from 16 to 64 in one of the embodiments of the invention. Instead of requiring multicasting or complete broadcasts, the traffic generated by columns, in any of the four zones represented (71-74), is restricted to driving inside. If we need to increase the number of columns even further, we should only increase the number of zones, so that the traffic remains almost constant even if the system scales greatly.
The encoder, that is, the proximal traffic, which the network sees globally, selects the potentially connected columns without making distinctions between the zones. This increases the number of available columns and, therefore, the ability to represent values (precision). For example, with a number of columns of approximately 2K-4K per zone, this additional flexibility may not be useful for reducing memory requirements in distal segments and increasing the complexity of the encoder. The present invention, in one of its embodiments, contemplates using as many consecutive values in the encoded input sequence as the number of scale-out zones. In this example, 4 encoders are used to simultaneously encode four different intervals of the input sequence. In this way, not only increases the performance of the system, but also the load on each individual column. In addition, increasing the number of zones will keep the total proximal traffic constant (since each entry in the sequence activates a number of input bits proportional to the number of columns per zone).
Different embodiments of the present invention have been tested in simulators adapted to employ appropriate data structures and mechanisms for a feasible hardware implementation and obtain energy and performance results.
5
10
fifteen
twenty
25
30
35
accurate. The input SDR encoder has been implemented using a Mersenne Twister pseudo-random generator. Synthetic time series have been used for the input modeling, specifically the periodic series of 32-bit integer data, generated from randomly defined polynomials (up to fourth grade with randomly chosen coefficients). A time series is defined by twenty values of each of them. Each time series is repeated until it is learned by the system, which occurs when the number of elements in the sequence, with columns without incorrect predictions (i.e. no column bursts), is equal to half of all data points In this way, the system stays half the time learning new sequences and the other half simply by predicting them. Therefore, during the middle of the intervals the additional traffic resulting from the column bursts or the inhibition traffic that will generate the appearance of a new input sequence will occur. During the second half of the time, the system will have a stable representation of the input, being very benign for network traffic. The number of time series (ie polynomials) necessary to meet a 98% confidence interval is approximately 500. Finally, the classifier used in these tests is the simplest of the NuPIC application, which provides an anomaly score such as the fraction of the columns with prediction failure.
Once the conditions of the tests are presented, some of the results obtained by the present invention are presented below, demonstrating that it manages to improve the delay and energy results of the system, while the scalability analysis demonstrates its feasibility for dozens of thousands of CCs column modules in the system.
In a first embodiment of the invention, the default configuration used by the NuPIC application is used: 2048 columns, with 32 time cells per column, and global inhibition. A 2D mesh topology is used, with a conventional basic router, DOR deterministic routing, a 4-cycle pipeline, and using virtual-cut through as flow control. We assume low-swing link cables that require a clock cycle to transfer a flit from one router to another and include 1280 byte input buffers, without virtual channels. Keep in mind that, with this configuration, all multicast traffic is transmitted to all system CCs. Therefore, by using replication in order of dimension in intermediate routers, a block-free network is obtained. The router has incorporated the network drainage mechanism, described above.
5
10
fifteen
twenty
25
30
35
Figure 8 graphically represents the number of clock cycles, per interval, for different sizes of 2D square mesh. That is, the number of network cycles necessary to carry out the tasks of each input interval, following a sequential and segmented approach to the different network sizes (different number of columns per CC). As you can see, up to 300 cycles can be saved by segmenting and overlapping the learning process. Another observation that may not be intuitive is the behavior of the segmented approach, since when the size of the network is increased, the time required is hardly modified. The reason for this behavior is containment. In this case, the network receives a greater load (since the three phases of the communication overlap). Therefore, when the network size is reduced, the available bandwidth is reduced and consequently the contention grows. Under these conditions, it seems that the bandwidth advantage compensates for the average distance increase. With the sequential approach, containment is only notable in the 4x4 mesh.
Figure 9 graphically depicts the number of clock cycles per interval for different 2D square meshes, using aggregate traffic (16-byte wide links with 5 flit packets). This figure has introduced aggregated traffic. Under the same conditions as the previous embodiment, the length of the package is carefully modeled, assuming its 5-flit size with 16-byte wide links, but another package is added when that limit is exceeded by the grouping process. As can be seen in this figure, the benefits are remarkable, being able to process the system's communication needs of an interval, in less than 60 network clock cycles. Regarding the configuration of the network, the result obtained reverses the previous observation on its size, since the traffic reduction is so drastic that containment is not present in any case. Therefore, under such a configuration, the dominant factor is the average distance of the network.
Figure 10 graphically depicts the number of clock cycles per interval for different 2D square meshes, with the segmented algorithm, traffic aggregation and proximal patches applied. In this embodiment, proximal patches have been introduced, which implies a significant benefit. In both cases, 20% of the columns in the system have been selected. Having more than 5 columns in each CC, the uniform distribution implies a diffusion. Proximal patches reduce this significantly, especially when the network size grows (and the benefit of converting a broadcast to a localized multicast is larger). Proximal traffic does not significantly change the anomaly score (that is, the
5
10
fifteen
twenty
25
30
probability of losing an activation of the column in the temporal grouping), being around 6% at the end of the simulation, in both cases.
Having contention so little impact on the network, it seems reasonable to reduce link bandwidth. The results above correspond to links 16 bytes wide, which is a fairly conventional size for many contemporary systems where chip networks are used. For example, for a 16x16 mesh the bisection bandwidth is approximately 512 GB / s, assuming 1ns of clock cycle. Therefore, in such circumstances, it seems interesting to explore the effect of reducing link width in order to reduce energy consumption and area.
Figure 11 graphically represents the number of clock cycles, varying the width of the link. For this, a specific example is used in which the variation of the time required to process an interval can be seen when the link width varies from 16 bytes to 1 byte. At this point, in order to select the best network configuration, its delay and the cost of the time / space calculation logic in the CCs must be balanced, so it is important to consider at least the following 3 aspects: (1) learning, in both cases, is out of the critical path; (2) the spatial logic is quite simple (it needs to calculate the overlap of the inputs) and since it operates in parallel with the temporal logic, they will not be in the critical path of the circuit either; (3) the generation of lateral activity in the temporal cell, which is in the critical path of the algorithm, is dominated by access to memory where distal segments are stored. Therefore, the number of accesses to said memory is a key element in all the logic that, according to different embodiments of the invention, can be structured in different ways. In any case, the optimistic hypothesis is that 1 clock cycle per distal segment will not be needed, that each column has 32 time cells and each time cell requires an average of 1 segment, which will require 32 clock cycles to process a single column In a 2K column system, like the one used with these results, the computation will require approximately from 4000 cycles (32 2048/16) in a 4x4 network, to 64 clock cycles in a 32x32 network. Therefore, in such cases, the most appropriate network is a 256-node network with 2-byte wide links. The scalability of the network delay will allow an adequate adjustment with respect to the cost of computing logic.
One of the greatest advantages that are pursued, and that solves a great problem of the state of the art, is to significantly increase the capacity of these systems and offer a real scalability that contemplates millions of columns. The present invention achieves said advantages and the proof of this is reflected in Figure 12.
5 In Figure 12, from a fixed configuration for the network that maintains the previous optimizations, it is possible to compare its results with a system with 4 scale-out zones, that is, in total 8K columns, but keeping the configuration of unaltered net. This shows the network clock cycles required to process a current input interval. The time required is shown using two 10 link widths, for different network sizes, and with and without scale-out zones. As can be seen, the delay is much less sensitive to the diameter of the network, being able to use a 32x32 mesh (M1024) with links 2 bytes wide and with results similar to those of the flat system (around 200 clock cycles) . According to these results, inhibition and distal traffic dominate the network load, since the increase in proximal traffic at destinations is negligible. The use of 8 scale-out zones (i.e. 16K columns) reports the same results. Therefore, it is evident that the present invention offers a system where the communication delay is independent of the number of columns.
At this point, assuming a network clock cycle of 1ns and that communication is the critical element of the algorithm, the present invention is capable of processing up to 100 million values of the input sequence per second. This performance is not attainable by any software approach. For example, the fastest Nupic current implementation can process around 1 000 entries in 1 second (for a similar system size 25 running on a current average machine).
Increasing the number of CCs in the system will increase the dynamic network power (each packet will go through more links and routers). Figure 13 shows the dynamic energy required by the network to process an interval. As you can see, the use of 30 scale-out zones reduces network requirements by up to 3 times, being possible to reduce, almost quadratically, the negative effects of their size.
To finalize this description, the performance, storage requirements and prediction efficiency of all previous embodiments are detailed below. Figure 14 shows, from a performance point of view, up to 92% of the network effects that could overlap with the other activities in the critical path.
Figure 15 shows how, in the same way, the dynamic energy of the network is reduced, which confirms that communication problems are almost non-existent and, therefore, strengthens the idea of the present invention that the solution based on Packet switching is a feasible proposal.
5 Figure 16 presents the probability of prediction failures per column, for each experiment (each experiment has approximately 3 million different intervals). Some of the changes introduced slightly alter the original CLA algorithm, but in most cases, the confidence margins indicate that the average of the cases is similar (that is, there are no changes in accuracy in the 10 results) . As noted above, the non-normalized value is around 6%. However, the proximal patches seem to slightly improve this figure to less than 5%. Apparently, this traffic optimization suggested by biology is beneficial from the point of view of precision. As expected, the use of scale-out zones slightly decreases the accuracy of the system back to the 15 results obtained with the base algorithm.
权利要求:
Claims (14)
[1]
5
10
fifteen
twenty
25
30
35
1. A hardware acceleration system for storing and retrieving information, which implements a cortical learning algorithm through a packet switching network, the system comprises:
at least one encoder module configured to encode a binary input in a distributed distributed representation (SDR), and to send, for each active bit of the SDR, a multicast packet to a determined column module through the packet switching network, based on a table of correspondence previously established;
a plurality of columnar modules connected by said packet switching network, configured to receive the multicast packets sent from the encoder, where each of the columnar modules in turn comprises:
or a router with multicast support configured to receive packets from the encoder module, deliver said packets to certain memory modules of the columnar module and send packets from the memory modules to an output sorter;
or a plurality of memory modules configured to store the inputs received from the router and store context information;
or a calculation module configured to determine a degree of overlap between the content of certain memory modules and the current input, select a specific number of memory modules with a greater degree of overlap, determine a time context for each of the modules of selected memory, make a prediction of the system output based on the current input and temporal context information and send an output packet containing said prediction to an output classifier module; an output classifier module configured to receive an output packet, sent through the switching network from any of the columnar modules, and to select a system output from a group of preset outputs based on the received packet output.
[2]
2. System according to claim 1, wherein the calculation module comprises a comparator, an adder and a counter.
5
10
fifteen
twenty
25
30
[3]
3. System according to any of the preceding claims, wherein each memory module of the plurality of memory modules comprises a plurality of temporary cells that adopt an active state or an non-active state and their combination represents a certain temporal context for the memory module
[4]
4. System according to claim 3, wherein the calculation module is further configured to check if its output prediction is correct; in case of a wrong prediction, a burst occurs that puts all the temporary cells of the memory module in active state.
[5]
5. System according to any of the preceding claims, wherein the calculation module is configured to combine stages and, given an input sequence, produce a prediction in three intervals of said sequence.
[6]
6. System according to claim 5 wherein the calculation module is further configured to add traffic from different stages in the same package.
[7]
7. System according to any of the preceding claims, wherein the columned modules of the ends of the network are configured to inject a broom package into the packet switching network, which is replicated in the rest of the modules columnar only when the corresponding router has no more queued packets until said broom packet reaches the opposite end of the network, indicating that the network has been emptied.
[8]
8. System according to any of the preceding claims, wherein the number of memory modules comprising each of the columnar modules is determined by a balance between the propagation delay and the system clock cycle.
[9]
9. System according to any of the preceding claims wherein at least one encoder module is configured to send an input packet to a selection of randomly preset columnar modules representing around 20% of the total columnar modules.
[10]
10. System according to any of the preceding claims made of a silicon plate, a chip or a microprocessor using CMOS technology.
[11]
11. Scalable hardware acceleration method for storing and retrieving information through a packet switching network, the method comprises the steps of:
a) encode, in an encoder module, a binary input in a distributed distributed representation (SDR)
b) send, for each active bit of the SDR, a multicast packet from the encoder module to a given column module of a plurality of modules
5 columnized through the packet switching network, depending on one
table of correspondence previously established;
c) receiving the packets sent from the encoder module, through the packet switching network, in a columned router;
d) deliver these packages to certain module memory modules
10 columnar;
e) store received packets in certain memory modules;
f) determine, in a column module calculation module, a degree of overlap between the contents of the memory modules that have received the input package and the current input;
15 g) select, for the calculation module, a certain number of modules
memory with greater degree of overlap;
h) determine, by the calculation module, a temporal context for each of the selected memory modules;
i) make, by the calculation module, a prediction of the system output in
20 function of current input and stored temporal context information
in memory modules;
j) send an output packet containing said prediction to an output sorter module;
k) receive an output package in the output classifier, sent through the
25 switching network from any of the columnar modules;
l) select, in the output classifier, a system output from a group of preset outputs based on the received output package.
[12]
12. Method according to claim 11, further comprising checking whether the output prediction made by the calculation module is correct, where, in case of
Wrong prediction produces a burst that puts all the temporary cells of the memory module in active state.
[13]
13. Method according to any of claims 10-12, further comprising verifying that the packet switching network is empty before
5 execute the step of calculating the overlap and before determining the temporal context, where, to verify that the network is empty, a broom package is provided that runs through the packet switching network.
[14]
14. Method according to any of claims 10-13, further comprising the step of restricting packets sent by the encoder module to a
10 selection of randomly preset columnar modules representing around 20% of the total columnar modules.
类似技术:
公开号 | 公开日 | 专利标题
US20200034687A1|2020-01-30|Multi-compartment neurons with neural cores
US20190377997A1|2019-12-12|Synaptic, dendritic, somatic, and axonal plasticity in a network of neural cores using a plastic multi-stage crossbar switching
EP3343465A1|2018-07-04|Neuromorphic computer with reconfigurable memory mapping for various neural network topologies
CN104813306B|2017-07-04|With the processing system for spreading processor DMA FIFO
US10713558B2|2020-07-14|Neural network with reconfigurable sparse connectivity and online learning
Pande et al.2013|Modular neural tile architecture for compact embedded hardware spiking neural network
Cattell et al.2012|Challenges for brain emulation: why is building a brain so difficult
CN109816102A|2019-05-28|Reconfigurable nerve synapse core for spike neural network
Navaridas et al.2015|SpiNNaker: Enhanced multicast routing
Kornijcuk et al.2019|Recent Progress in Real‐Time Adaptable Digital Neuromorphic Hardware
Vu et al.2019|Comprehensive analytic performance assessment and k-means based multicast routing algorithm and architecture for 3D-NoC of spiking neurons
ES2558952B2|2016-06-30|Scalable system and hardware acceleration method to store and retrieve information
Okuyama et al.2019|Analytical performance assessment and high-throughput low-latency spike routing algorithm for spiking neural network systems
Akbari et al.2017|A high-performance network-on-chip topology for neuromorphic architectures
Krishna et al.2020|Data orchestration in deep learning accelerators
De Meyer2000|Explorations in stochastic diffusion search: Soft-and hardware implementations of biologically inspired spiking neuron stochastic diffusion networks
KR20190113555A|2019-10-08|Neuromorphic accelerator multitasking
Gruau et al.2002|The Blob: A Basic Topological Concept for “Hardware-Free” Distributed Computation
Hasan et al.2013|Routing bandwidth model for feed forward neural networks on multicore neuromorphic architectures
Agarwal et al.2015|The era of neurosynaptics: neuromorphic chips and architecture
Vu et al.2019|A low-latency tree-based multicast spike routing for scalable multicore neuromorphic chips
Bhardwaj2018|On Multicast in Asynchronous Networks-on-Chip: Techniques, Architectures, and FPGA Implementation
Franzon et al.2016|Hardware Acceleration of Sparse Cognitive Algorithms
Farahini et al.2013|A conceptual custom super-computer design for real-time simulation of human brain
Hasan2014|Multi-core architectures for feed-forward neural networks
同族专利:
公开号 | 公开日
WO2017085337A1|2017-05-26|
ES2558952B2|2016-06-30|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题
US20120078827A1|2007-01-05|2012-03-29|Knowmtech Llc|Hierarchical temporal memory methods and systems|
US20110225108A1|2010-03-15|2011-09-15|Numenta, Inc.|Temporal memory using sparse distributed representation|
US20130054495A1|2011-08-25|2013-02-28|Numenta, Inc.|Encoding of data for processing in a spatial and temporal memory system|
法律状态:
2016-06-30| FG2A| Definitive protection|Ref document number: 2558952 Country of ref document: ES Kind code of ref document: B2 Effective date: 20160630 |
优先权:
申请号 | 申请日 | 专利标题
ES201500841A|ES2558952B2|2015-11-20|2015-11-20|Scalable system and hardware acceleration method to store and retrieve information|ES201500841A| ES2558952B2|2015-11-20|2015-11-20|Scalable system and hardware acceleration method to store and retrieve information|
PCT/ES2016/000125| WO2017085337A1|2015-11-20|2016-11-17|Scalable hardware-based acceleration system and method for storing and recovering information|
[返回顶部]